CSSS/SOC/STAT 321: Lab 7

Linear Regression: Nonlinear Terms

Tao Lin

Office Hours: Fri 1:30 - 3:30 PM Smith 35

Section Slides URL: soxv/CSSS-321-Labs

Goals for Week 7

  • QSS Tutorial 6
  • Key Question in Week 7
    • How can we add and interpret categorical terms in a linear regression model?
    • How can we add and interpret interaction terms in a linear regression model?
    • How can we diagnose and add non-linear terms in a linear regression model?
    • How can we use regression model and counterfactual scenarios to predict outcome variable.
  • Midterm Exam

Categorical Terms

Agenda

  • Categorical Terms

  • Interaction Terms

  • Non-linear Terms

  • Which Term to Add?

  • Midterm

  • If X is a binary predictor, we have shown that \beta_1 is the average effect of X on Y.

Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i

  • If our explanatory variable is a categorical variable with more than 2 categories, we need to “binarize” it into multiple binary variables, each of which denotes one category.
    • Optional: if the categorical variable is numeric typc in R, we need to transform it to factor or character variable.
    • choose one category as the baseline/reference category
    • code each of the other categories as a binary variable, called a dummy variable.
ID Group
1 Group 1
2 Group 2
3 Group 3
4 Group 4

\Rightarrow

ID Group 2 Group 3 Group 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
  • Interpretation for coefficients of categorical variables
    • \beta_0: group mean of reference category
    • \beta_j (j \neq \text{ref}): the average effect of category j compared to the reference category
      • \beta_0 + \beta_j: group mean of category j

Y_i = \beta_0 + \beta_1 \cdot \mathbb{I}(X_i = C_2) + \beta_2 \cdot \mathbb{I}(X_i = C_3) + \beta_3 \cdot \mathbb{I}(X_i = C_4) + \varepsilon_i

data(social, package = "qss")
social <- unique(social)

# categorical variable
social$messages |> head(n = 10)
 [1] "Civic Duty" "Civic Duty" "Hawthorne"  "Hawthorne"  "Hawthorne" 
 [6] "Control"    "Control"    "Control"    "Control"    "Control"   
# binarization/one-hot encoding inside the `lm()` function
model.matrix(~ messages, data = social) |> head() # allow intercept
  (Intercept) messagesControl messagesHawthorne messagesNeighbors
1           1               0                 0                 0
2           1               0                 0                 0
3           1               0                 1                 0
4           1               0                 1                 0
5           1               0                 1                 0
6           1               1                 0                 0
model.matrix(~ -1 + messages, data = social) |> head() # drop intercept to recover reference category
  messagesCivic Duty messagesControl messagesHawthorne messagesNeighbors
1                  1               0                 0                 0
2                  1               0                 0                 0
3                  0               0                 1                 0
4                  0               0                 1                 0
5                  0               0                 1                 0
6                  0               1                 0                 0
lm(primary2006 ~ messages, data = social) # intercept can be interpreted as the mean of the first quartile group, and coefficient can be interpret as the average effect of the other three quartile groups.

Call:
lm(formula = primary2006 ~ messages, data = social)

Coefficients:
      (Intercept)    messagesControl  messagesHawthorne  messagesNeighbors  
          0.45590            0.01629            0.01170            0.02352  
lm(primary2006 ~ -1 + messages, data = social) # if we drop the intercept, then the coefficient can be directly interpreted as the means of each quartile group.

Call:
lm(formula = primary2006 ~ -1 + messages, data = social)

Coefficients:
messagesCivic Duty     messagesControl   messagesHawthorne   messagesNeighbors  
            0.4559              0.4722              0.4676              0.4794  
tapply(social$primary2006, social$messages, mean, na.rm = TRUE) # dropping intercept is equivalent to using `tapply()` to compute group means
Civic Duty    Control  Hawthorne  Neighbors 
 0.4559029  0.4721919  0.4676056  0.4794211 

Interaction Terms

Agenda

  • Categorical Terms

  • Interaction Terms

  • Non-linear Terms

  • Which Term to Add?

  • Midterm

Binary + Binary

res <- lm(
  primary2006 ~ primary2004 + messages + primary2004:messages,
  data = social,
  subset = messages %in% c("Control", "Neighbors")
) # or we can specify `lm(primary2006 ~ primary2004 * messages, ...)`
res

Call:
lm(formula = primary2006 ~ primary2004 + messages + primary2004:messages, 
    data = social, subset = messages %in% c("Control", "Neighbors"))

Coefficients:
                  (Intercept)                    primary2004  
                     0.452678                       0.038677  
            messagesNeighbors  primary2004:messagesNeighbors  
                     0.002786                       0.008534  

\begin{aligned} \texttt{primary2006} =& \underbrace{0.23711}_{\beta_0} + \underbrace{0.14870}_{\beta_1} \times \texttt{primary2004} + \underbrace{0.06930}_{\beta_2} \times \texttt{messagesNeighbors} \\ & + \underbrace{0.02723}_{\beta_3} \times \texttt{primary2004:messagesNeighbors} \end{aligned}

Control (\texttt{messagesNeighbors} = 0) Neighbors (\texttt{messagesNeighbors} = 1)
Didn’t Vote in 2004 (\texttt{primary2004} = 0) \beta_0 \beta_0 + \beta_2
Voted in 2004 (\texttt{primary2004} = 1) \beta_0 + \beta_1 \beta_0 + \beta_1 + \beta_2 + \beta_3
# create hypothetical scenarios to get predictions
hypo_dat <- expand.grid(
  primary2004 = c(0, 1),
  messages = c("Control", "Neighbors")
)
hypo_dat$lm_pred <- predict(res, newdata = hypo_dat)
hypo_dat
  primary2004  messages   lm_pred
1           0   Control 0.4526779
2           1   Control 0.4913545
3           0 Neighbors 0.4554637
4           1 Neighbors 0.5026738

Binary + Continuous

social$age <- 2006 - social$yearofbirth
res <- lm(
  primary2006 ~ age + messages + age:messages,
  data = social,
  subset = messages %in% c("Control", "Neighbors")
)
res

Call:
lm(formula = primary2006 ~ age + messages + age:messages, data = social, 
    subset = messages %in% c("Control", "Neighbors"))

Coefficients:
          (Intercept)                    age      messagesNeighbors  
            0.5046521             -0.0005985             -0.0344851  
age:messagesNeighbors  
            0.0007726  

\begin{aligned} \texttt{primary2006} =& \underbrace{0.0975}_{\beta_0} + \underbrace{0.0039}_{\beta_1} \times \texttt{age} + \underbrace{0.0498}_{\beta_2} \times \texttt{messagesNeighbors} \\ & + \underbrace{0.0006}_{\beta_3} \times \texttt{age:messagesNeighbors} \\ =& (\beta_0 + \beta_1 \times \texttt{age}) + \texttt{messagesNeighbors} \times (\beta_2 + \beta_3 \times \texttt{age}) \\ =& \begin{cases} \beta_0 + \beta_1 \times \texttt{age} & \text{if } \texttt{messagesNeighbors} = 0 \\ (\beta_0 + \beta_2) + (\beta_1 + \beta_3) \times \texttt{age} & \text{if } \texttt{messagesNeighbors} = 1 \end{cases} \end{aligned}

hypo_dat <- expand.grid(
  age = seq(from = 25, to = 85, by = 20),
  messages = c("Control", "Neighbors")
)
hypo_dat$primary2006_pred <- predict(res, newdata = hypo_dat)
hypo_dat
  age  messages primary2006_pred
1  25   Control        0.4896886
2  45   Control        0.4777178
3  65   Control        0.4657470
4  85   Control        0.4537763
5  25 Neighbors        0.4745173
6  45 Neighbors        0.4779976
7  65 Neighbors        0.4814778
8  85 Neighbors        0.4849581
plot(primary2006_pred ~ age, type = "l", col = "blue", data = subset(hypo_dat, messages == "Control"))
lines(primary2006_pred ~ age, type = "l", col = "red", data = subset(hypo_dat, messages == "Neighbors"))
legend("bottomright", title = "Household Size", c("Control", "Neighbors"), lty = 1, col = c("blue", "red"))

Continuous + Continuous

res <- lm(
  primary2006 ~ age + hhsize + age:hhsize,
  data = social
)
res

Call:
lm(formula = primary2006 ~ age + hhsize + age:hhsize, data = social)

Coefficients:
(Intercept)          age       hhsize   age:hhsize  
  0.5830103   -0.0014437   -0.0368314    0.0004591  

\begin{aligned} \texttt{primary2006} =& \underbrace{0.3310}_{\beta_0} - \underbrace{0.0003}_{\beta_1} \times \texttt{age} - \underbrace{0.0943}_{\beta_2} \times \texttt{hhsize} \\ & + \underbrace{0.0019}_{\beta_3} \times \texttt{age:hhsize} \\ =& (\beta_0 + \beta_1 \times \texttt{age}) + \texttt{hhsize} \times (\beta_2 + \beta_3 \times \texttt{age}) \\ =& (\beta_0 + \beta_2 \times \texttt{hhsize}) + \texttt{age} \times (\beta_1 + \beta_3 \times \texttt{hhsize}) \end{aligned}

hypo_dat <- expand.grid(
  age = c(25, 45, 65),
  hhsize = c(2, 3, 4)
)
hypo_dat$primary2006_pred <- predict(res, newdata = hypo_dat)
hypo_dat
  age hhsize primary2006_pred
1  25      2        0.4962083
2  45      2        0.4856969
3  65      2        0.4751856
4  25      3        0.4708540
5  45      3        0.4695244
6  65      3        0.4681948
7  25      4        0.4454998
8  45      4        0.4533520
9  65      4        0.4612041
plot(primary2006_pred ~ age, type = "l", col = "blue", data = subset(hypo_dat, hhsize == 2))
lines(primary2006_pred ~ age, type = "l", col = "green", data = subset(hypo_dat, hhsize == 3))
lines(primary2006_pred ~ age, type = "l", col = "red", data = subset(hypo_dat, hhsize == 4))
legend("bottomright", c("2", "3", "4"), lty = 1, col = c("blue", "green", "red"))

plot(primary2006_pred ~ hhsize, type = "l", col = "blue", data = subset(hypo_dat, age == 25), ylim = c(0, 0.5))
lines(primary2006_pred ~ hhsize, type = "l", col = "green", data = subset(hypo_dat, age == 45))
lines(primary2006_pred ~ hhsize, type = "l", col = "red", data = subset(hypo_dat, age == 65))
legend("bottomright", title = "Age", c("25", "45", "65"), lty = 1, col = c("blue", "green", "red"))

Sometimes the values of continuous covariates cannot be zero. The unrealistic “zero” scenario often makes our interpretation non-sense. To ease our interpretation, we can center our covariates, that is,

Y_i = \beta_0 + \beta_1 (X_i - \bar{X}) + \beta_2 (Z_i - \bar{Z}) + \beta_3 (X_i - \bar{X}) (Z_i - \bar{Z})

  • \beta_0: the value of Y when both X and Z equal to their means.
  • \beta_1: the average effect of X on Y when Z equal to its mean.
  • \beta_2: the average effect of Z on Y when X equal to its mean.
  • \beta_3:
    • If X increase by 1 unit, the average effect of Z on Y will increase by \beta_3 units.
    • If Z increase by 1 unit, the average effect of X on Y will increase by \beta_3 units.

The resulting coefficients will be different, but the predicted values of outcome variable will remain unchanged.

social$age_centered <- social$age - mean(social$age)
social$hhsize_centered <- social$hhsize - mean(social$hhsize)
res <- lm(
  primary2006 ~ age_centered + hhsize_centered + age_centered:hhsize_centered,
  data = social
)
res

Call:
lm(formula = primary2006 ~ age_centered + hhsize_centered + age_centered:hhsize_centered, 
    data = social)

Coefficients:
                 (Intercept)                  age_centered  
                   0.4726881                    -0.0002060  
             hhsize_centered  age_centered:hhsize_centered  
                  -0.0122714                     0.0004591  
hypo_dat <- expand.grid(
  age_centered = c(25 - mean(social$age), 45 - mean(social$age), 65 - mean(social$age)),
  hhsize_centered = c(2 - mean(social$age), 3 - mean(social$age), 4 - mean(social$age))
)
hypo_dat$primary2006_pred <- predict(res, newdata = hypo_dat)
hypo_dat
  age_centered hhsize_centered primary2006_pred
1   -28.497564       -51.49756        1.7842391
2    -8.497564       -51.49756        1.3072818
3    11.502436       -51.49756        0.8303245
4   -28.497564       -50.49756        1.7588849
5    -8.497564       -50.49756        1.2911093
6    11.502436       -50.49756        0.8233338
7   -28.497564       -49.49756        1.7335306
8    -8.497564       -49.49756        1.2749369
9    11.502436       -49.49756        0.8163431

Non-linear Terms

Agenda

  • Categorical Terms

  • Interaction Terms

  • Non-linear Terms

  • Which Term to Add?

  • Midterm

In-class Exercise

  1. Download and load the health2.csv data from Canvas.
  2. Subset the data set with the condition grepl("2017", date). That is, we only consider records in 2017.
  3. Regress weight on dayofyear, steps.lag, calorie.lag. We call it a restricted model.
  4. Plot residuals against each explanatory variable. Which non-linear term do we need to add?
  5. Rerun the regression model, but this time we add a quadratic term of dayofyear, an interaction term between dayofyear and calorie.lag, and an interaction term between I(dayofyear^2) and calorie.lag. We call it an unrestricted model.
  6. (Optional) Perform an F-test between restricted model and unrestricted model using anova().
  7. Use the following code to create several counterfactual scenarios
hypo_dat <- expand.grid(
  dayofyear = 1:365,
  calorie.lag = quantile(health$calorie.lag, c(0.25, 0.5, 0.75)),
  steps.lag = mean(health$steps.lag)
)
  1. Use the unrestricted model and counterfactual scenarios to predict weights.
  2. Plot predicted values (y-axis) under different dayofyear (x-axis) when calorie.lag equals to the first quartile, median, and the third quartile, respectively.
# 1. load the data
health <- read.csv("./data/health2.csv")
# 2. subset the data to 2017
health <- subset(health, grepl("2017", date))
# 3. run regression
res <- lm(weight ~ dayofyear + steps.lag + calorie.lag, data = health)
# 4. plot residuals against each explanatory variable
par(mfrow = c(1, 3))
plot(x = health$dayofyear, y = residuals(res))
plot(x = health$steps.lag, y = residuals(res))
plot(x = health$calorie.lag, y = residuals(res))

# 5. re-run regression with quadratic and interaction terms
quad_res <- lm(weight ~ (dayofyear + I(dayofyear^2)) * calorie.lag + steps.lag, data = health)
# 6. F-test
anova(res, quad_res)
Analysis of Variance Table

Model 1: weight ~ dayofyear + steps.lag + calorie.lag
Model 2: weight ~ (dayofyear + I(dayofyear^2)) * calorie.lag + steps.lag
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    210 3870.4                                  
2    207 1158.1  3    2712.2 161.59 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 7. create counterfactual scenarios
hypo_dat <- expand.grid(
  dayofyear = 1:365,
  calorie.lag = quantile(health$calorie.lag, c(0.25, 0.5, 0.75)),
  steps.lag = mean(health$steps.lag)
)
# 8. Use the unrestricted model and hypothetical scenario to predict weights
hypo_dat$weight_pred <- predict(quad_res, hypo_dat)
# 9. Plot predicted values under different dayofyear when calorie.lag equals to the first quartile, median, and the third quartile, respectively.
plot(weight_pred ~ dayofyear, type = "l", col = "blue", data = subset(hypo_dat, calorie.lag == quantile(health$calorie.lag, 0.25)))
lines(weight_pred ~ dayofyear, type = "l", col = "green", data = subset(hypo_dat, calorie.lag == quantile(health$calorie.lag, 0.5)))
lines(weight_pred ~ dayofyear, type = "l", col = "red", data = subset(hypo_dat, calorie.lag == quantile(health$calorie.lag, 0.75)))
legend("bottomright", title = "Active Calorie", c("1st Quartile", "Median", "3rd Quartile"), lty = 1, col = c("blue", "green", "red"))

Which Term to Add?

Agenda

  • Categorical Terms

  • Interaction Terms

  • Non-linear Terms

  • Which Term to Add?

  • Midterm

Diagnostics Based on Prediction and Precision

  • Scatter plot between residuals \hat{\varepsilon} and explanatory variable X
    • check whether there is an unexplained pattern in the data.
  • R^2 and F test (Not required currently)
    • If adding an addtional term can increase the predictive power, the sum of squared residuals in the new (unrestricted) model should be smaller than that of the old (restricted) model.

\text{F-statistic} = \frac{\text{(scaled) reduction in prediction error}}{\text{(scaled) remaining prediction error}} = \frac{(\text{RSS}_\text{restricted} - \text{RSS}_\text{unrestricted}) / q}{\text{RSS}_\text{unrestricted} / (n - k - 1)} \sim F_{q, n - k - 1}

Add Terms Based on Causality

Good Control: Confounders

D \leftarrow X \rightarrow Y

A confounder is the common cause for both D and Y, e.g. whether a student is a senior can both affect whether she uses ChatGPT and the test score.

set.seed(98105)
X <- rnorm(1e3)
D <- 2 * X + rnorm(1e3)
Y <- 5 * D + 10 * X + rnorm(1e3)

lm(Y ~ D) |> summary() # biased estimate of direct effect

Call:
lm(formula = Y ~ D)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.771  -3.156   0.033   3.153  16.398 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.11747    0.14848  -0.791    0.429    
D            9.15366    0.06697 136.676   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.694 on 998 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9492 
F-statistic: 1.868e+04 on 1 and 998 DF,  p-value: < 2.2e-16
lm(Y ~ D + X) |> summary() # including the confounder X will correct the estimate

Call:
lm(formula = Y ~ D + X)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.87279 -0.65034  0.03951  0.65597  2.84164 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02544    0.03100   0.821    0.412    
D            4.97982    0.03147 158.264   <2e-16 ***
X           10.09999    0.06822 148.056   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9796 on 997 degrees of freedom
Multiple R-squared:  0.9978,    Adjusted R-squared:  0.9978 
F-statistic: 2.254e+05 on 2 and 997 DF,  p-value: < 2.2e-16

Bad Control: Post-treatment variables

D \rightarrow X \rightarrow Y

A post-treatment variable is a mediator between D and Y, e.g. a student’s study time can be affected by whether the student uses ChatGPT, and her study time will then influence the test score.

set.seed(98105)
D <- rnorm(1e3)
X <- 5 * D + rnorm(1e3)
Y <- 10 * X + rnorm(1e3)

lm(Y ~ D) |> summary()  # this will give us roughly correct estimate

Call:
lm(formula = Y ~ D)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.749  -6.495  -0.012   6.507  34.479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.3957     0.3125   1.266    0.206    
D            49.4850     0.3057 161.866   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.883 on 998 degrees of freedom
Multiple R-squared:  0.9633,    Adjusted R-squared:  0.9633 
F-statistic: 2.62e+04 on 1 and 998 DF,  p-value: < 2.2e-16
lm(Y ~ D + X) |> summary()  # including the post-treatment variable X will mask the relationship between D and Y

Call:
lm(formula = Y ~ D + X)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.87279 -0.65034  0.03951  0.65597  2.84164 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02544    0.03100   0.821    0.412    
D            0.16053    0.15844   1.013    0.311    
X            9.97982    0.03147 317.170   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9796 on 997 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9996 
F-statistic: 1.384e+06 on 2 and 997 DF,  p-value: < 2.2e-16

Bad Control: Colliders

Y \rightarrow X \leftarrow D

A collider is the common consequence of both D and Y, e.g. a student’s evaluation on TA is influenced by both whether the student uses ChatGPT and whether the student gets good test scores.

set.seed(98105)
D <- rnorm(1e3)
X <- 5 * D + Y + rnorm(1e3)

lm(Y ~ D) |> summary()  # D and Y have no correlation

Call:
lm(formula = Y ~ D)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.749  -6.495  -0.012   6.507  34.479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.3957     0.3125   1.266    0.206    
D            49.4850     0.3057 161.866   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.883 on 998 degrees of freedom
Multiple R-squared:  0.9633,    Adjusted R-squared:  0.9633 
F-statistic: 2.62e+04 on 1 and 998 DF,  p-value: < 2.2e-16
lm(Y ~ D + X) |> summary()  # including the collider X will create information flows between D and Y

Call:
lm(formula = Y ~ D + X)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.266959 -0.060928  0.001974  0.059513  0.250430 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.0019968  0.0028121    0.710   0.4778    
D           -0.0256371  0.0143557   -1.786   0.0744 .  
X            0.9096635  0.0002589 3513.841   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08885 on 997 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.683e+08 on 2 and 997 DF,  p-value: < 2.2e-16

Shutting the Backdoor

  • Include all pre-treatment variables in regression models.
  • Control confounders as many as possible.
  • Don’t include colliders.

Midterm

Agenda

  • Categorical Terms

  • Interaction Terms

  • Non-linear Terms

  • Which Term to Add?

  • Midterm